Digitised historical text: Does it have to be mediOCRe?
نویسندگان
چکیده
This paper reports on experiments to improve the Optical Character Recognition (ocr) quality of historical text as a preliminary step in text mining. We analyse the quality of ocred text compared to a gold standard and show how it can be improved by performing two automatic correction steps. We also demonstrate the impact this can have on named entity recognition in a preliminary extrinsic evaluation. This work was performed as part of the Trading Consequences project which is focussed on text mining of historical documents for the study of nineteenth century trade in the British Empire.
منابع مشابه
VARD 2: A tool for dealing with spelling variation in historical corpora
Spelling variation causes considerable problems for corpus linguistic techniques such as frequency analysis, concordancing and automatic tagging, with a significant impact being made on recall and the accuracy of results [1]. This paper will focus on Early Modern English, the most recent period of the English language to include a large amount of inconsistent spelling. Although many corpora of ...
متن کاملiDoc: Interactive Analysis, Transcription and Translation of Old Text Documents TIN2006-15694-C02
There are huge historical document collections residing in libraries, museums and archives that are currently being digitised for preservation purposes and to make them available worldwide through large, on-line digital libraries. The main objective, however, is not to simply provide access to raw images of digitised documents, but to annotate them with their real informative content and, in pa...
متن کاملNamed Entity Recognition for Digitised Historical Texts
We describe and evaluate a prototype system for recognising person and place names in digitised records of British parliamentary proceedings from the late 17th and early 19th centuries. The output of an OCR engine is the input for our system and we describe certain issues and errors in this data and discuss the methods we have used to overcome the problems. We describe our rule-based named enti...
متن کاملMargins are more important than text, Historical values of margins, memorial notes and colophons of Manuscripts in Zoroastrian tradition
In the Zoroastrian tradition, the most important challenge and the most ambiguous issue is ambiguity in history and neglect of time and chronology. Perhaps, this approach that historical time is limit and the begging and end of time is clear and the goodness will be conqueror eventually; it is because of ambiguity of history in Zoroastrian tradition.since early time to now, the Zoroastrian re...
متن کاملPoetic Affection in Historical prose of Nafsat ol-Masdoor
Abstract Nafsat ol-Masdoor is one of the outstanding artificial and technical texts in prose that remained from the Mongol era. This text has not only been mainly written to record historical events, but also, the writer has been intended to explain his biography and problems. The writer uses various literary arts to affect the addressee and one of these componen...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012